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A standard deep convolutional neural net- 
work paired with a suitable loss function 
learns compact local image descriptors that 
perform comparably to state-of-the art ap- 
proaches. 



1. General Learning Architecture 

Recently, several machine learning based ap- 
proaches (Brown etal., 2010; Simonyan et al., 
2012; Trzcinski et al., 2012) have shown impressive 
results for finding compact low-level image represen- 
tations. These representations are considered good 
when corresponding image patches are described by 
representations that are close by. 

DrLim (Hadsell et al., 2006) is a framework for energy 
based models that learns representation using only 
such correspondence relationships. We utilize DrLim 
to train a convolution neural network for learning low- 
dimensional mappings for low-level image patches. 

The main idea behind DrLim is to map similar (i.e. 
corresponding) image patches to nearby points on the 
output manifold and dissimilar image patches to dis- 
tant points. It is defined over pairs of image patches, 
a;i,X2. The i-th pair {x\,X2) is associated with a la- 
bel y*, with J/' = 1 if x\ and a;| are deemed similar 
and y' = otherwise. We denote by d{xi,X2;0) the 
parameterized distance function between the represen- 
tations of xi and X2 that we want to learn. Based on 
d{xi,X2;d) we define DrLim's loss function i{0): 
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We denote with ^pii(-) the partial loss function for sim- 
ilar pairs (it pulls similar pairs together) and with 
^psii(') the partial loss function for dissimilar pairs (it 
pushes dissimilar pairs apart), ^psh is defined as in 
(Hadsell et al., 2006): 

£psh{d{xi,X2;0)) = Cpsh[max(0, mpsh ~ d{xi,X2]9))f' 

rripsh is the push margin: Dissimilar pairs are not 
pushed farther apart if they already are at a distance 
greater than the push margin. Cpsh is a scaling factor. 

For ^pii we use a loss similar to hinge loss, differently 
to the loss function proposed in the original DrLim 
formulation: 

ifi\{d{xi,X2]9)) = Cpii[max(0, d(a;i,X2;6') - TOpu)] 

Cpii is a scaling factor, TTipn is a pull margin: Similar 
pairs are pulled together only if they are at a distance 
above iripw. 

d{xi,X2]9) is defined as the Euclidean distance be- 
tween the learned representations of xi and X2: 

d[x^,x2\e) = \\f{xr,e) - f{x2;e)\\2 

/(•) denotes the mapping from the (high-dimensional) 
input space to the low-dimensional space. In this pa- 
per, / is a convolutional neural network(Jarrett et al., 
2009). The layers of the convolutional network com- 
prise a convolutional layer Ci (kernel size 5x5) with 
6 feature maps, a subsampling layer ^i, a second con- 
volutional layer C2 (kernel size 6x6) with 21 feature 
maps, a subsampling layer S2, a third convolutional 
layer C3 (kernel size 5x5) with 55 feature maps and 
a fully connected layer with 32 units. 



2. Experiments 

We evaluate our proposed model on the dataset from 
(Brown et al., 2010). The dataset is based on more 
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than 1.5 million image patches (64 x 64 pixels) of three 
different scenes: the Statue of Liberty (about 450,000 
patches), Notre Dame (about 450,000 patches) and 
Yosemites Half Dome (about 650,000 patches). We 
denote these scenes with LY, ND and HD respectively. 
There arc 250000 corresponding image patch pairs and 
250000 non-corresponding image patch pairs available 
for every scene. We train on one scene and evalu- 
ate the learned embedding function on the other two 
scenes. Evaluation is done on the same test sets (50000 
matching and non-matching pairs) used also by other 
approaches. 

Table 1 shows that convolutional networks (last en- 
try) perform comparably to other state-of-the-art ap- 
proaches. The appeal of a simple parameteric model 
like a convolutional neural network is that it does not 
require any complex paramter tuning or pipeline opti- 
mization and that it can be integrated into larger sys- 
tems that can then be trained in an end-to-end fashion 
(HadscU, 2008). 

The architecture is trained with standard gradient de- 
scent. Training stops when a local minima of the Dr- 
Lim objective is reached. Notably, the hyperparame- 
ters (cpii, mpii, Cpsh, mpsh) used in our evaluation are 
not scene dependent. 

3. More data 

Convolutional Neural Networks benefit from abundant 
data (Ciresan et al., 2012; Krizhevsky et al., 2012). 
Utilizing data from two scenes improves error rates no- 
ticebly: We get 15.1% on LY with combined training 
on ND and HD (in total IM patch pairs). Similarly, 
we get 8.5% on ND and 14.3% on HD. 
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Test set 




Method 


Tr. set 


LY 


ND 


HD 


SIFT 


- 


31.7 


22.8 


25.6 


L-BGM 

(64d) 


LY 
ND 


18.0 


14.1 


19.6 

15.8 


HD 


21.0 


13.7 


- 


Brown et al. 
(29d) 


LY 
ND 
HD 


16.8 
18.2 


X 

11.9 


X 

13.5 


Simonyan et al, 
(29d) 


LY 

■ ND 


14.5 


X 


X 

12.5 


HD 


17.4 


9.6 


- 


CNN 
(32d) 


LY 





11.2±0.3 


18.5±o.5 


ND 


16.4±o.3 


- 


16.2±o.3 


HD 


18.9±o.4 


10.7±o.2 


- 



Table 1. Error rates, i.e. the percent of incorrect matches 
when 95% of the true matches are found. Every subtable, 
indicated by an entry in the Method column, denotes a 
descriptor algorithm. The line below every method de- 
notes the size of the desciptor (e.g. 32d denotes a 32 
dimensional descriptor). The 128 dimensional SIFT de- 
scriptor (Lowe, 2004) does not require learning (denoted 
by — in the column Tr. set (i.e. Training set)). The 
numbers in the columns labeled LY, ND and HD are the 
error rates of a method on the respective test set for this 
scene. (Brown et al., 2010; Simonyan et al., 2012) do not 
have results when trainend on the LY scene (indicated by 
x). L-BGM is presented in (Trzcinski et al., 2012). The 
mean error rates for convolutional neural networks (CNN) 
are given with a standard deviation over 10 runs. 
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